MS5130 Assignment 3

Author

Manoj Hanumantha

FIFA World Cup Data Analysis

Image Source: Pinterest

FIFA World Cup Data Analysis Flowchart:

flowchart TD
    A[Load Data: matches.csv<br>team_appearances.csv<br>player_appearances.csv<br>goals.csv] --> B[Matches Data]
    A --> C[Team Appearances Data]
    A --> D[Player Appearances Data]
    A --> E[Goals Data]
    B & C & D & E -->|Combine Data| F[Combined Data]
    F -->|Exploratory Data Analysis| G[EDA]
    G -->|Check for Missing Values| H[Missing Values]
    G -->|Visualize Score Distribution| I[Score Distribution]
    G -->|Analyze Correlation| J[Correlation Analysis]
    G -->|Explore Result Patterns| K[Result Patterns]
    G -->|Investigate Goals Distribution| L[Goals Distribution]
    G -->|Assess Predictor Variables| M[Predictor Variables Analysis]
    F -->|Qualitative Analysis| N[Qualitative Analysis]
    N -->|Text Mining| O[Text Mining]
    O -->|Identify Player Nationalities| P[Player Nationalities]
    O -->|Explore Player Positions| Q[Player Positions]
    O -->|Goal Distribution by Minute| R[Goals by Minute]
    O -->|Generate Match Data Word Cloud| S[Match Data Word Cloud]
    O -->|Generate Player Nationalities Word Cloud| T[Player Nationalities Word Cloud]   
    F -->|Quantitative Analysis| U[Quantitative Analysis]
    U -->|Modeling| V[Modeling]
    V -->|Prepare Data| W[Data Preparation]
    V -->|Model Training and Testing | X[Train & Test]
    V -->|Fit Multinomial Logistic Regression| Y[Model Fitting]
    V -->|Predict and Evaluate Model| Z[Prediction & Model Evaluation]
    V -->|Generate Summary| AA[Model Summary]

classDef blue fill:#9f9,stroke:#333,stroke-width:2px;
class B,C,D,E green;

%% Set the size of the flowchart
%%{init: {"themeVariables": {"flowchart": {"maxWidth": "100%", "maxHeight": "600px"}}}}

Install and load the necessary packages and libraries:

options(repos = "https://cloud.r-project.org")  # Set CRAN mirror
install.packages('readr')
install.packages('tidyverse')
install.packages("ISLR")
install.packages("gam")
install.packages("interactions")
install.packages("htmlwidgets")
library(readr)
library(tidyr)
library(dplyr)
library(leaflet)
library(plotly)
library(ggplot2)
library(forcats)
library(gam)
library(ISLR)
library(interactions)
library(htmlwidgets)

Check and set the present working directory to the location where the datasets are present:

# Check the present working directory
getwd()
[1] "C:/Users/Manoj/Desktop/Semester 2 Assignments/Applied Analytics in Business and Society/Assignment 3/FIFA"
# Set the working directory to the location where the datasets are present
setwd("C:/Users/Manoj/Desktop/Semester 2 Assignments/Applied Analytics in Business and Society/Assignment 3/FIFA")

Step1: Data Preparation:

Import the datasets matches, team_appearances, player_appearances and goals using readr package:

#Import the datasets
matches <- read.csv("matches.csv")
team_appearances <- read.csv("team_appearances.csv")
player_appearances <- read.csv("player_appearances.csv")
goals <- read.csv("goals.csv")

Combine the datasets matches, team_appearances, player_appearances and goals using common column match_id present in all four datasets:

# Combine datasets using common columns
combined_data <- matches %>%
  left_join(team_appearances, by = "match_id") %>%
  left_join(player_appearances, by = "match_id") %>%
  left_join(goals, by = "match_id")

View(combined_data)
head(combined_data)
str(combined_data)
summary(combined_data)

Step 2: Exploratory Data Analysis:

# Exploratory Data Analysis (EDA)
install.packages("corrplot")
library(corrplot)
install.packages("corrplot")
library(corrplot)
# Check for missing values
missing_values <- colSums(is.na(combined_data))
print(missing_values)
                 key_id.x           tournament_id.x         tournament_name.x 
                        0                         0                         0 
                 match_id              match_name.x              stage_name.x 
                        0                         0                         0 
             group_name.x             group_stage.x          knockout_stage.x 
                        0                         0                         0 
               replayed.x                  replay.x              match_date.x 
                        0                         0                         0 
             match_time.x              stadium_id.x            stadium_name.x 
                        0                         0                         0 
              city_name.x            country_name.x              home_team_id 
                        0                         0                         0 
           home_team_name            home_team_code              away_team_id 
                        0                         0                         0 
           away_team_name            away_team_code                     score 
                        0                         0                         0 
          home_team_score           away_team_score    home_team_score_margin 
                        0                         0                         0 
   away_team_score_margin              extra_time.x        penalty_shootout.x 
                        0                         0                         0 
          score_penalties home_team_score_penalties away_team_score_penalties 
                        0                         0                         0 
                 result.x             home_team_win             away_team_win 
                        0                         0                         0 
                   draw.x                  key_id.y           tournament_id.y 
                        0                         0                         0 
        tournament_name.y              match_name.y              stage_name.y 
                        0                         0                         0 
             group_name.y             group_stage.y          knockout_stage.y 
                        0                         0                         0 
               replayed.y                  replay.y              match_date.y 
                        0                         0                         0 
             match_time.y              stadium_id.y            stadium_name.y 
                        0                         0                         0 
              city_name.y            country_name.y                 team_id.x 
                        0                         0                         0 
              team_name.x               team_code.x               opponent_id 
                        0                         0                         0 
            opponent_name             opponent_code               home_team.x 
                        0                         0                         0 
              away_team.x                 goals_for             goals_against 
                        0                         0                         0 
        goal_differential              extra_time.y        penalty_shootout.y 
                        0                         0                         0 
            penalties_for         penalties_against                  result.y 
                        0                         0                         0 
                      win                      lose                    draw.y 
                        0                         0                         0 
               key_id.x.x         tournament_id.x.x       tournament_name.x.x 
                     1530                      1530                      1530 
           match_name.x.x            match_date.x.x            stage_name.x.x 
                     1530                      1530                      1530 
           group_name.x.x                 team_id.y               team_name.y 
                     1530                      1530                      1530 
              team_code.y               home_team.y               away_team.y 
                     1530                      1530                      1530 
              player_id.x             family_name.x              given_name.x 
                     1530                      1530                      1530 
           shirt_number.x             position_name             position_code 
                     1530                      1530                      1530 
                  starter                substitute                   captain 
                     1530                      1530                      1530 
               key_id.y.y                   goal_id         tournament_id.y.y 
                     3266                      3266                      3266 
      tournament_name.y.y            match_name.y.y            match_date.y.y 
                     3266                      3266                      3266 
           stage_name.y.y            group_name.y.y                   team_id 
                     3266                      3266                      3266 
                team_name                 team_code                 home_team 
                     3266                      3266                      3266 
                away_team               player_id.y             family_name.y 
                     3266                      3266                      3266 
             given_name.y            shirt_number.y            player_team_id 
                     3266                      3266                      3266 
         player_team_name          player_team_code              minute_label 
                     3266                      3266                      3266 
        minute_regulation           minute_stoppage              match_period 
                     3266                      3266                      3266 
                 own_goal                   penalty 
                     3266                      3266 

Distribution of home_team_score and away_team_score:

# Distribution of home_team_score and away_team_score
plot <- ggplot(combined_data, aes(x = home_team_score, y = away_team_score)) +
  geom_point() +
  labs(x = "Home Team Score", y = "Away Team Score", title = "Distribution of Home and Away Team Scores") +
  theme_minimal()
plotly::ggplotly(plot)

The above scatter plot of “Distribution of Home and Away Team Scores” presents the following observations:

  • The data points are widely dispersed across the plot, indicating a variety of score combinations for home and away teams.
  • The range for the home team score is 0 to 10, while the away team score ranges from 0 to 6.
  • The plot suggests that both high and low scores are possible for either team, regardless of whether they are playing at home or away.

Correlation matrix correlation matrix of match statistics:

# Correlation matrix correlation matrix of match statistics including home_team_score, away_team_score, goals_for, and goals_against.
correlation_matrix <- cor(combined_data[, c("home_team_score", "away_team_score", "goals_for", "goals_against")])
corrplot(correlation_matrix, method = "color")

The above correlation matrix for match statistics provides several key insights:

  • Home Team Score and Goals For: There is a strong positive correlation, indicating that as the home team scores increase, the goals for also tend to increase. This suggests that the home team’s offensive performance is a significant contributor to their scoring.

  • Home Team Score and Goals Against: A negative correlation is observed, meaning that as the home team scores more, they tend to concede fewer goals. This could imply effective defensive play or control of the game when the home team is leading.

  • Away Team Score and Goals Against: There is a positive correlation, suggesting that when the away team scores, the goals against them are also higher. This might reflect a more open game where both teams are scoring.

  • Away Team Score and Goals For: A negative correlation is observed, which could indicate that when the away team is scoring, the home team’s goals for are lower, possibly due to the away team’s defensive strategy.

  • Goals For and Goals Against: These are negatively correlated, which is expected as teams that score more often concede less and vice versa.

Boxplot of home_team_score and away_team_score by result:

# Boxplot of home_team_score and away_team_score by result
plot <- ggplot(combined_data, aes(x = result.x, y = home_team_score)) +
  geom_boxplot() +
  labs(x = "Result", y = "Home Team Score", title = "Boxplot of Home Team Score by Result") +
  theme_minimal()
plotly::ggplotly(plot)
plot <- ggplot(combined_data, aes(x = result.x, y = away_team_score)) +
  geom_boxplot() +
  labs(x = "Result", y = "Away Team Score", title = "Boxplot of Away Team Score by Result") +
  theme_minimal()
plotly::ggplotly(plot)

Combining both the home and away team score box plots by match results, presents the following observations:

  • Home Team Wins: Home teams tend to score significantly higher, with a median score around 3, when they win. Away teams, on the other hand, score much less in their losses, with a median score around 1.

  • Away Team Wins: Away teams have a median score of about 2 when they win, indicating that they don’t need to score as many goals as home teams do to win. Home teams typically score very low, with a median score around 1, when they lose.

  • Draws: Both home and away teams have a median score of around 2 in draws, suggesting a more balanced outcome in terms of scoring.

  • Score Range and Variability: Home teams show a wider range of scores and greater variability, especially in games they win, which is indicated by the presence of several outliers. Away teams generally have a narrower score range across all results.

  • Outliers: There are outliers present in all categories for both home and away teams, indicating occasional matches with unusually high scores.

These observations suggest that home teams have a more pronounced advantage in scoring, particularly in matches they win. Away teams, while they tend to score fewer goals, can still secure wins with fewer goals scored. Draws tend to be more evenly matched in terms of scoring for both teams. The presence of outliers in both plots indicates that while there are general trends, individual matches can have unexpected outcomes.

Histograms to check normality of predictor variables:

# Histograms to check normality of predictor variables
histograms <- combined_data %>%
  select(home_team_score, away_team_score, goals_for, goals_against) %>%
  gather() %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 20, fill = "skyblue", color = "black", alpha = 0.8) +
  facet_wrap(~key, scales = "free") +
  labs(x = "Value", y = "Frequency", title = "Histograms of Predictor Variables") +
  theme_minimal()

plotly::ggplotly(histograms)

The Histograms show the distribution of four predictor variables: away_team_score, goals_against, goals_for, and home_team_score. Below are some observations that can be made from the histograms:

Skewness:

  • Away_team_score: The histogram for away_team_score is skewed to the right. This means that there are more games with lower away team scores and fewer games with very high away team scores.
  • Goals_against: The histogram for goals_against is also skewed to the right, but to a lesser extent than away_team_score. This suggests that there are more games with fewer goals scored against a team and fewer games with very high goals scored against a team.
  • Goals_for: The histogram for goals_for is also skewed to the right, but less so than away_team_score. This suggests that there are more games with fewer goals scored by a team and fewer games with very high goals scored by a team.
  • Home_team_score: The histogram for home_team_score is somewhat symmetrical, with a slight skew to the right. This suggests that the distribution of home team scores is more evenly spread out, but there are still slightly more games with lower home team scores.

The histograms for all four variables appear to have a single mode, which is the most frequent value.

Overall, the histograms suggest that the data is not perfectly normally distributed. All four variables have some degree of positive skew, meaning that there are more frequent lower values than higher values. This is a common finding in sports data, where scores tend to be clustered at the low end but can vary widely on the high end.

Step 3: Quantitative Analysis (Modeling)

For the quantitative analysis, let us use a multinomial logistic regression model using the mgcv and nnet libraries in R. The process involved several steps as outlined below:

  1. Data Preparation:
    • The outcome variable result.y was transformed into a factor with predefined levels: “win”, “lose”, and “draw”.
    • Unique values in result.y were verified to ensure proper encoding.
    • Missing values in result.y were checked and accounted for.
# Quantitative Analysis

# Modeling 
library(mgcv)
library(nnet)

# Convert result.y variable to a factor with specified levels
combined_data$result.y <- factor(combined_data$result.y, levels = c("win", "lose", "draw"))

# Check unique values in result.y
unique(combined_data$result.y)
[1] win  lose draw
Levels: win lose draw
# Check for missing values in result.y
sum(is.na(combined_data$result.y))
[1] 0
  1. Model Training and Testing:
    • The dataset was split into training and test sets with a ratio of 70:30, respectively.
    • Random indices were generated to create the training dataset.
    • The presence of all outcome categories, including “draw”, was confirmed in the training data.
    • The reference category for modeling was set to “draw” using relevel().
# Set seed for reproducibility
set.seed(123)

# Sample size for training data, 70% training and 30% test data
train_size <- floor(0.7 * nrow(combined_data))

# Generate random indices for training data
train_indices <- sample(seq_len(nrow(combined_data)), size = train_size)

# Create training and test datasets
train_data <- combined_data[train_indices, ]
test_data <- combined_data[-train_indices, ]

# Check unique values in result.y in train_data
unique(train_data$result.y)
[1] draw lose win 
Levels: win lose draw
# Check if the "draw" category is present in the model
"draw" %in% levels(train_data$result.y)
[1] TRUE
# Set the reference category to "draw"
train_data$result.y <- relevel(train_data$result.y, ref = "draw")
  1. Model Fitting:
    • The multinomial logistic regression model was fitted using predictors such as home team score, away team score, and goals for.
# Fit the multinomial logistic regression model
model <- multinom(result.y ~ home_team_score + away_team_score + goals_for, data = train_data)
# weights:  15 (8 variable)
initial  value 77088.525684 
iter  10 value 26595.433583
iter  20 value 7838.517378
iter  30 value 7604.920126
final  value 7601.906689 
converged
  1. Prediction and Evaluation:
    • Predictions were made on the test data using the fitted model.
    • The accuracy of the predictions was calculated to assess the model’s performance.
# Predict on test data
predictions <- predict(model, newdata = test_data)

# Check levels of result.y after modeling 
levels(train_data$result.y)
[1] "draw" "win"  "lose"
# Calculate accuracy
accuracy <- mean(predictions == test_data$result.y)
# Print accuracy in %
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
[1] "Accuracy: 96.88 %"
# Summary of the model
summary(model)
Call:
multinom(formula = result.y ~ home_team_score + away_team_score + 
    goals_for, data = train_data)

Coefficients:
     (Intercept) home_team_score away_team_score goals_for
win    -2.126455       -14.98860       -14.19132  29.09556
lose   -2.157068        13.75401        12.72983 -26.55352

Std. Errors:
     (Intercept) home_team_score away_team_score goals_for
win   0.05589458        8.368049        4.394396  9.451744
lose  0.05644698        3.643451        2.826032  4.611278

Residual Deviance: 15203.81 
AIC: 15219.81 

Summary of Multinomial Logistic Regression Model:

Coefficients:

The coefficients represent the estimated effects of each predictor variable on the log-odds of the outcome classes (win and lose).

  • Intercept:
    • For the “win” class, the intercept is -2.126455.
    • For the “lose” class, the intercept is -2.157068.
  • Home Team Score (home_team_score):
    • For the “win” class, the coefficient is -14.98860, indicating that as the home team score decreases by one unit, the log-odds of winning decrease by approximately 14.99.
    • For the “lose” class, the coefficient is 13.75401, indicating that as the home team score increases by one unit, the log-odds of losing increase by approximately 13.75.
  • Away Team Score (away_team_score):
    • For the “win” class, the coefficient is -14.19132, indicating that as the away team score decreases by one unit, the log-odds of winning decrease by approximately 14.19.
    • For the “lose” class, the coefficient is 12.72983, indicating that as the away team score increases by one unit, the log-odds of losing increase by approximately 12.73.
  • Goals For (goals_for):
    • For the “win” class, the coefficient is 29.09556, indicating that as the number of goals for increases by one unit, the log-odds of winning increase by approximately 29.10.
    • For the “lose” class, the coefficient is -26.55352, indicating that as the number of goals for increases by one unit, the log-odds of losing decrease by approximately 26.55.

Standard Errors:

The standard errors represent the uncertainty in the estimated coefficients. Lower standard errors indicate more precise estimates.

  • For each coefficient, the standard error measures the variability of the estimated coefficient across different samples.

Residual Deviance:

The residual deviance measures how well the model fits the data. Lower values indicate better fit. In our model case, the residual deviance is 15203.81.

AIC (Akaike Information Criterion):

The AIC is a measure of the relative quality of the model. Lower AIC values indicate better models. The AIC for this model is 15219.81.

Model Accuracy:

The accuracy of the model is approximately 96.88%. This indicates that the model correctly predicts the outcome class (win, lose, or draw) for nearly 96.88% of the observations in the test dataset.

Model Performance Analysis:

# Model Analysis
install.packages("pROC")
install.packages("pdp")
library(pROC)
library(pdp)

Confusion Matrix:

# Confusion Matrix
conf_matrix <- table(test_data$result.y, predictions)
conf_matrix_plot <- plot_ly(z = ~as.matrix(conf_matrix), colorscale = "Viridis", type = "heatmap",
                            x = colnames(conf_matrix), y = rownames(conf_matrix),
                            colorbar = list(title = "Counts"))
conf_matrix_plot <- layout(conf_matrix_plot, title = "Confusion Matrix", 
                           xaxis = list(title = "Predicted Class"),
                           yaxis = list(title = "Actual Class"))
conf_matrix_plot

The performance of the multinational machine learning model for predicting football match outcome was evaluated using a confusion matrix. The analysis revealed the following key observations:

  • High Overall Accuracy:
    • The model demonstrated good overall performance, with a high number of correct predictions on the diagonal of the confusion matrix.
    • Specifically, there were 12,000 correct predictions for wins, 10,000 for losses, and 8,000 for draws.
  • Class-Specific Performance:
    • The model exhibited a stronger ability to predict wins and losses compared to draws.
    • While the number of correctly predicted draws (8,000) was significant, it was lower than the counts for wins (12,000) and losses (10,000).
  • Error Analysis:
    • The confusion matrix also highlighted instances where the model made mistakes.
    • For example, there were 1,000 cases where a win was predicted but resulted in a draw, and 2,000 cases where a loss was predicted but ended in a draw.

Prediction Distribution Plot:

# Prediction Distribution Plot

# Extract predicted probabilities for each class
probabilities <- predict(model, newdata = test_data, type = "probs")

# Plot histograms of predicted probabilities for each class
par(mfrow = c(1, length(levels(train_data$result.y))))
for (class in levels(train_data$result.y)) {
  hist(probabilities[, class], main = paste("Predicted Probabilities for", class), xlab = "Probability")
}

The performance of the multinational machine learning model for predicting football match outcome was evaluated using a prediction distribution plot. The analysis revealed the following key observations:

  • Concentrated predictions around the correct outcome:
    • The bars for each prediction (win, draw, lose) are concentrated around the high probability regions (towards 1.0 on the y-axis).
    • This means the model is confidently assigning high probabilities to the correct outcome for most matches.
  • Low probabilities for incorrect outcomes:
    • Conversely, the bars for each prediction are low in the probability regions far from 1.0.
    • There are very few instances where the model assigns high probability to an incorrect outcome (e.g., predicting a win with a high probability when the team actually loses).

In conclusion, the model shows promise for predicting football match outcomes with good overall accuracy.

Step 4: Qualitative Analysis (Text Mining)

Text mining on player names to identify common nationalities:

Let us analyze player names to predict their nationalities based on certain patterns in their family names. First select the unique player identifiers and family names from the dataset, apply rules using regular expressions to predict nationalities, counts players for each nationality, and creates an interactive bar chart visualizing the nationality distribution. The chart helps identify the most common nationalities among players based on their names.

# Qualitative Analysis

# Text mining on player names to identify common nationalities
player_nationalities <- combined_data %>%
  select(player_id.x, family_name.x, given_name.x) %>%
  distinct() %>%
  mutate(nationality = case_when(
    grepl("van", family_name.x, ignore.case = TRUE) ~ "Dutch",
    grepl("al-", family_name.x, ignore.case = TRUE) ~ "Arabic",
    grepl("sson", family_name.x, ignore.case = TRUE) ~ "Scandinavian",
    grepl("sch", family_name.x, ignore.case = TRUE) ~ "German",
    grepl("ez", family_name.x, ignore.case = TRUE) ~ "Spanish",
    grepl("eau", family_name.x, ignore.case = TRUE) ~ "French",
    grepl("ini", family_name.x, ignore.case = TRUE) ~ "Italian",
    grepl("ic$", family_name.x, ignore.case = TRUE) ~ "Serbian",
    grepl("son$", family_name.x, ignore.case = TRUE) ~ "Icelandic" # Identify Icelandic players
  ))

# Count the number of players for each nationality
nationality_counts <- table(player_nationalities$nationality)

# Create a data frame for plotting
nationality_data <- data.frame(nationality = names(nationality_counts),
                               count = as.numeric(nationality_counts))

# Reorder the levels of nationality in descending order of count
nationality_data$nationality <- factor(nationality_data$nationality, 
                                       levels = rev(nationality_data$nationality))

# Define custom colors for each nationality
custom_colors <- c(
  "Dutch" = "#ff7f0e",     # Orange
  "Arabic" = "#008000",    # Green
  "Scandinavian" = "#ff69b4",  # Pink
  "German" = "#000000",    # Black
  "Spanish" = "#d62728",   # Red
  "French" = "#1f77b4",    # Blue
  "Italian" = "#0000ff",   # Dark Blue
  "Serbian" = "#7f7f7f",   # Gray
  "Icelandic" = "#87ceeb"  # Sky Blue
)

# Reorder the custom colors accordingly
custom_colors <- custom_colors[levels(nationality_data$nationality)]

# Plotting an interactive bar chart
plot1 <- plot_ly(data = nationality_data, x = ~nationality, y = ~count, type = "bar", 
                 marker = list(color = custom_colors[nationality_data$nationality])) %>%
  layout(title = "Player Nationalities prediction based on their names",
         xaxis = list(title = "Nationality", categoryorder = "total descending"),  # Set categoryorder to total descending
         yaxis = list(title = "Count"))

# Save the plot as a HTML widget
htmlwidgets::saveWidget(plot1, "player_nationalities_by_their_name.html")
plot1

Key observations:

The above analysis shows that Spanish names are the most prevalent, followed by Scandinavian, Dutch, Arabic, and German. Names suggesting Icelandic, Italian, Serbian, and French origins are less common.

Distribution of player positions using a bar chart:

Next let us analyze the distribution of player positions in the dataset. First group players by their positions, count the number of distinct players for each position, and arrange them in descending order. Then, create an interactive bar chart visualizing the count of players for each position, helping to understand the distribution of players across different positions in the dataset.

# Distribution of player positions using a bar chart
position_distribution <- combined_data %>%
  group_by(position_name) %>%
  summarise(distinct_player_count = n_distinct(player_id.x)) %>%
  arrange(desc(distinct_player_count)) %>%
  mutate(position_name = factor(position_name, levels = position_name))

# Plotting position distribution
plot2 <- position_distribution %>%
  plot_ly(x = ~position_name, y = ~distinct_player_count, type = "bar") %>%
  layout(title = "Count of Players by Position",
         xaxis = list(title = "Position"),
         yaxis = list(title = "Distinct Player Count"))

# Save the plot as a HTML widget
htmlwidgets::saveWidget(plot2, "player_position_distribution.html")
plot2

Key observations:

The above bar chart reveals that midfielders constitute the largest group, underscoring their pivotal role in gameplay. Defenders and forwards also represent significant proportions, reflecting the essential balance between offensive and defensive strategies in team composition. The chart effectively highlights the diversity of roles within a football team, with midfielders leading in numbers. This visualization aids in understanding the commonality of player positions and their strategic importance in the sport.

Goal Distribution by Minute using a line chart:

Now let analyze the distribution of goals scored over different regulation minutes in matches. Let us group the data by match ID, regulation minute, and goal ID to count distinct goals scored in each minute. Then, sum up the goals count for each minute and generates a line chart to visualize the goal distribution.

# Goal Distribution by Minute using a line chart
# Grouping data by match_id, minute_label, and goal_id to count distinct goals
goal_distribution <- combined_data %>%
  filter(!is.na(minute_regulation)) %>%
  group_by(match_id, minute_regulation, goal_id) %>%
  summarise(goals_count = n_distinct(goal_id))

# Grouping data by minute_label to sum up the goals count for each minute
goal_distribution <- goal_distribution %>%
  group_by(minute_regulation) %>%
  summarise(goals_count = sum(goals_count))

# Create hover text
hover_text <- paste("Minute = ", goal_distribution$minute_regulation, "<br>Number of Goals = ", goal_distribution$goals_count)

# Plotting goal distribution by minute
plot3 <- plot_ly(x = ~goal_distribution$minute_regulation, y = ~goal_distribution$goals_count, 
                 type = "scatter", mode = "lines+markers",
                 marker = list(color = 'rgba(65, 105, 225, .6)'), 
                 line = list(shape = "spline"),
                 text = hover_text) %>%
  layout(title = "Goal Distribution by Minute",
         xaxis = list(title = "Minute"),
         yaxis = list(title = "Goals Count"))

# Save the plot as a HTML widget
htmlwidgets::saveWidget(plot3, "goal_distribution.html")
plot3
# Calculate total sum of goals scored over all minutes
total_goals <- sum(goal_distribution$goals_count)

# Print total sum of goals
print(paste("Total Goals Scored:", total_goals))
[1] "Total Goals Scored: 2548"

Key observations:

The above plot indicates a fluctuating pattern of goal scoring, with a notable peak around the 90th minute, suggesting a surge in goals towards the match’s end. This trend could reflect the intensity of play during the final moments when teams often press for a decisive goal. The visualization captures the dynamic nature of goal scoring and can provide strategic insights into periods of high scoring probability within a game.

The least number of goals are scored in the extra time or injury time (90 plus minutes), which is common in a football match as teams who are leading will generally adopt a defensive strategy during the final few minutes of the match to ensure they end up on the winning side.

Text Analysis and Word Cloud Generation using Match Data:

#Word Cloud Generation using Match Data
install.packages("tm")
package 'tm' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Manoj\AppData\Local\Temp\Rtmp80ypoh\downloaded_packages
install.packages("wordcloud")
package 'wordcloud' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Manoj\AppData\Local\Temp\Rtmp80ypoh\downloaded_packages
install.packages("RColorBrewer")
package 'RColorBrewer' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Manoj\AppData\Local\Temp\Rtmp80ypoh\downloaded_packages
library(tm)
library(wordcloud)
library(RColorBrewer)

This code segment performs text analysis on various columns of data, combining them into a single corpus. It then processes the text by removing numbers, punctuation, whitespace, converting to lowercase, and removing English stopwords. Afterward, it creates a document-term matrix and converts it into a data frame. Word frequencies are computed and sorted, and finally, a word cloud is generated based on the sorted word frequencies, showcasing the most common words in the combined text data.

# Combine text from multiple columns into a single corpus
text_corpus <- paste(combined_data$match_name.x, combined_data$stadium_name.x, combined_data$city_name.x, combined_data$country_name.x, combined_data$team_name.x, combined_data$team_name.y, combined_data$player_team_name)

# Create a corpus
docs <- Corpus(VectorSource(text_corpus))

# Clean the text
docs <- docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords("english"))

# Create a document-term matrix
dtm <- DocumentTermMatrix(docs)

# Convert the matrix to a data frame
dtm_df <- as.data.frame(as.matrix(dtm))

# Compute word frequencies
word_freq <- colSums(dtm_df)

# Sort words by frequency
sorted_word_freq <- sort(word_freq, decreasing = TRUE)

# Create a word cloud
wordcloud(words = names(sorted_word_freq), freq = sorted_word_freq, min.freq = 1, max.words = 200, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))

Key observations of what the word cloud indicates:

  • Countries and Teams: Words like Germany, Brazil, Argentina, and France are prominent, suggesting these are popular football nations often mentioned in the data.

  • Match Venues: The presence of words such as “stadium” and specific stadium names points to the various locations where significant football matches have been held.

  • Frequency Indication: The size of each word correlates with its frequency in the dataset, with larger words appearing more frequently, indicating their prominence in the context of international football matches.

This visualization effectively communicates the key aspects of the dataset, emphasizing the most common terms and providing insights into the focus areas within the data related to football matches. The word cloud serves as a visual summary, showcasing the relative importance of different countries, cities, and stadiums in the dataset.

Word Cloud of Player Nationalities in Combined Data:

The code extracts player nationalities from various columns in the combined data and creates a word cloud visualization to display the frequencies of these nationalities. The word cloud’s size represents each nationality’s frequency.

# Extract player nationalities
player_nationalities <- c(combined_data$player_team_name, combined_data$team_name.x, combined_data$team_name.y)

# Create a word cloud of player nationalities
wordcloud(player_nationalities, 
          min.freq = 5,
          scale = c(5, 0.5),
          colors = brewer.pal(8, "Dark2"),
          random.order = TRUE,
          main = "Player Nationalities Word Cloud")

The word cloud visualization represents the diversity of player nationalities within the dataset. The size of each word is indicative of the frequency of that nationality’s occurrence, providing a quick visual representation of the most common nationalities among players.

Notably, countries such as Argentina, Germany, Italy, Belgium, and Neterlands are more prominent, suggesting a higher number of players hail from these nations. Conversely, smaller words denote nationalities that are less frequent within the data.

This visualization serves as a useful tool for discerning patterns at a glance, such as the dominance of European players if European countries are represented by larger words in the cloud.